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Abstract 

The minimization of the logistic loss is a popular approach to batch supervised learning. Our 
paper starts from the surprising observation that, when fitting linear (or kernelized) classifiers, 
the minimization of the logistic loss is equivalent to the minimization of an exponential rodo-loss 
computed (i) over transformed data that we call Rademacher observations (rados), and (ii) over 
the same classifier as the one of the logistic loss. Thus, a classifier learnt from rados can be 
directly used to classify observations. We provide a learning algorithm over rados with boosting- 
compliant convergence rates on the logistic loss (computed over examples). Experiments on 
domains with up to millions of examples, backed up by theoretical arguments, display that 
learning over a small set of random rados can challenge the state of the art that learns over 
the complete set of examples. We show that rados comply with various privacy requirements 
that make them good candidates for machine learning in a privacy framework. We give several 
algebraic, geometric and computational hardness results on reconstructing examples from rados. 
We also show how it is possible to craft, and efhciently learn from, rados in a differential privacy 
framework. Tests reveal that learning from differentially private rados can compete with learning 
from random rados, and hence with batch learning from examples, achieving non-trivial privacy 
vs accuracy tradeoffs. 


1 Introduction 

This paper deals with the following fundamental question: 

What information is sufficient for learning, and what guarantees can it bring that regular data 
cannot ? 

By “regular”, we mean the usual inputs provided to a learner. In our context of batch supervised 
learning, this is a training set of examples, each of which is an observation with a class, and learning 
means inducing in reduced time an accurate function from observations to classes, a classifier. It 
turns out that we do not need the detail of classes to learn a classifier (linear or kernelized): an 
aggregate, whose size is the dimension of the observation space, is minimally sufficient, the mean 
operator [2i] . 


1 


But do we need examples ? 

This perhaps surprising and non-trivial question is becoming crucial now that the nature of 
stored and processed signals intelligence data is heavily debated in the public sphere [IHl |28] . In 
the context of machine learning (ML), the objective of being accurate is more and more frequently 
subsumed by more complex goals, sometimes involving challenging tradeoffs in which accuracy does 
not ultimately appear in the topmost requirements. Privacy is one such crucial goal [iniEKIS]. 
There are various models to capture the privacy requirement, such as secure multi-party computa¬ 
tion and differential privacy (DP, [E]). The former usually relies on cryptographic protocols, which 
can be heavy even for bare classification and simple algorithms [1]. The latter usually relies on 
the power of randomization to ensure that any “local” change cannot be spotted from the output 
delivered [laiE]. In a ML setting, randomization can be performed at various stages, from the 
examples to the output of a classifier. We focus on the upstream stage of the process, i.e. the input 
to the learner, which grants the benefits that all subsequent stages also comply with differential 
privacy. Randomization has its power: it also has its limits in this case, as it may significantly 
degrade the performance of learners. 

The way we address this problem starts from a surprising observation, whose relevance to 
supervised ML goes beyond learning with private data: learning a linear (or kernelized) classifier 
over examples throughout the minimization of the expected logistic loss is equivalent to learning 
the same classifier by minimizing an exponential loss over a complete set of transformed data that 
we call Rademacher observations, rados. Each rado is the sum of edge vectors over examples (edge 
= observation x label). We also show that efficient learning from all rados may also be achieved 
when carried out over subsets of all possible rados. 

This is our first contribution, and we expect it to be useful in several other areas of supervised 
learning. In the context of learning with private data, our other contributions can be summarized 
as showing how rados may yield new privacy guarantees — not limited to differential privacy — 
while authorising boosting-compliant rates for learning. More precisely, our second contribution is 
to propose a rado-based learning algorithm, which has boosting-compliant convergence rates over 
the logistic loss computed over the examples. Thus, we learn an accurate classifier over rados, and 
the same classifier is accurate over examples as well. 

The fact that efficient learning may be achieved through subset of rados is interesting because it 
opens the problem of designing this particular subset to address domain-specific requirements that 
add to the ML accuracy requirement. Among our other contributions, we provide one important 
design example, showing how to build differentially private mechanisms for rado delivery, such as 
when protecting specific sensitive features in data. Experiments confirm in this case that learning 
from differentially private rados may still be competitive with learning from examples. We provide 
another design which pairs to our rado-based boosting algorithm, with the crucial property that 
when examples have been DP-protected by the popular Gaussian mechanism |T2], the joint pair 
(rado delivery design, boosting algorithm) may achieve convergence rates comparable to the noise- 
free setting with high probability, even over strong DP protection regimes. Our last contribution is 
to show that rados may protect the privacy of the original examples not only in the DP framework, 
but also from several algebraic, geometric and even computational-complexity theoretic standpoints. 

The remainder of this paper is organized as follows. Section ^ presents Rademacher obser¬ 
vations, shows the equivalence between learning from examples and learning from rados, and how 
learning from subsets of rados may be sufficient for efficient learning; ^ presents our rado-based 
boosting algorithm, and ^ presents experiments with this algorithm; ^ presents our results in 
DP models, ^presents related experiments; ^provides results on the hardness of reconstructing 
examples from rados from algebraic, geometric and computational standpoints. To keep a readable 
paper, proofs and additional experiments are given in two separate appendices available in Section 
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10 (proofs) and Section 11 (experiments). 


2 Rados and supervised learning 

Let [n] = {1, 2,n}. We are given a set of m examples S = {{xi, yi),i G [m]}, where a;, G X C 
is an observation and y, G {—1,1} is a label, or class. X is the domain. A linear classifier 0 G 0 for 
some fixed 0 C gives a label to £c G X equal to the sign of O^x G M. Our results can be lifted 
to kernels (at least with finite dimension feature maps) following standard arguments [2^. We let 
= 1 - 1 , 1 }™. 

Definition 1 For any cr G Sm; the Rademacher observation Tt^- with signature a is Tt^- = (1/2) • 

The simplest way to randomly sample rados is to pick cr as i.i.d. Rademacher variables, hence the 
name. Reference to S is implicit in the dehnition of Tier- A Rademacher observation sums edge 
vectors (the terms yiXi), over the subset of examples for which yi = ai. When cr = y is the vector 
of classes, Ttg. = m/xg is m times the mean operator [261124| . When cr = —y, we get the null vector 
Ttcr = 0 . A popular approach to learn 6 over S is to minimize the surrogate risk Tiog (S,0) built 
from the logistic loss (logloss): 


Fiog(S,0) = ^Y^log (l + exp (^-yiO'''Xi^^ . (1) 

i 

We define the exponential rado-risk 6 , U), computed on any U C Sm with cardinal IRI = n, 

as: 

f:,^(S,0,U) = i ^ exp . (2) 

(tSIX 

It turns out that -^log = g{F^^p) for some continuous strictly increasing g; hence, minimizing one 
criterion is equivalent to minimizing the other and vice versa. This is stated formally in the following 
Lemma. 


Lemma 2 The following holds true, for any 6 and S; 

Fiog(S,0) = log(2) + -logT, 

m 


exp 


(S, 0, 


(3) 


(Proof in the Appendix, Subsection 10.1). Lemma shows that learning with examples via the 
minimization of Tiog (S, 0), and learning with all rados via the minimization of Ff,^p{§, 0, S^), are 
essentially equivalent tasks. Since the cardinal |Sm| = 2™ is exponential, it is unrealistic, even on 
moderate-size samples, to pick that latter option. This raises however a very interesting question: 
if we replace by subset U of size <C 2™,what does the relationship between examples and rados 
in eq. © become? We answer this question under the setting that: 

(i) instead of we consider a predefined C 


(ii) instead of considering R = S,., we sample uniformly i.i.d. R ~ for n > 1 rados. 
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While (ii) is directly targeted at reducing the number of rados, (i) is an upper-level strategic design 
to tackle additional constraints, such as differential privacy. We now need following definition of 
the logistic rado-risk: 

F{,^{§,e,U) = log(2) + llogFj,p(S,0,l[) , (4) 

6 m ^ 

for any IX C Sm, so that Fiog (S, 0) = (S, 0, T^m)- We also define the open ball 23(0, r) = {x ^ 

: ||®||2 < r }. 


Theorem 3 Assume 0 C 23(0, rg), for some rg > 0. Let: 

_ sup0/g0max7t^gs, exp(-0'^7t^) 

^ “ F4p(S,0,S,) 

, _ F4p(S,0,S,) 

^ F;,p(§,0,S^) ’ 

where follows (i) above. Then Vp > 0, there is probability > 1 — p over the sampling ofU in (ii) 
above that: 


Fiog(S,0) < Fi;g(S,0,U) + Q---log(l-^) , 


m 


with 


2c71 1 

q = n ( ^ • .\/re max ||7to-||2 + dlog —^ + log - 

Sr CL T| 


and Q = —(1/m) • log£»' satisfies (5 = 0 ifT,r = Sm and 

Q < re (llV^Fiog (S,0,Sm) lb+ ^r) 


(5) 

( 6 ) 

(7) 


otherwise, letting -k^ = ||IE<t~Sj.( 1/?^) • Tto-lb- Furthermore, VO < /3 < 1/2, if m is sufficiently large, 
then letting tt* = maxs,, ||(l/m) • Tt^lb; ineq. 0 becomes: 


-7iog(S)^) < 


Fi;g(s, 0 ,u) + g 



IrgTT* d , 2en 

-1-log 

n nm ap 


( 8 ) 


(Proof in the Appendix, Subsection 10. 2[ ) Theoremdoes not depend on the algorithm that learns 
0. The right-hand side of ineq. ([5) shows two penalties. Q arises from the choice of and is 
therefore structural. Regardless of when the classifier is reasonably accurate over all rados and 
expected examples edges in average to a ball of reduced radius, the upperbound on Q in ineq. 
Q can be very small. The other penalty, which depends on q, is statistical and comes from the 
sampling in Theorem shows that when = Em, even when n <C m, the minimization of 
-^log (Sj^jTt) may still bring, with high probability, guarantees on the minimization of Fiog (S,0). 
Thus, a lightweight optimization procedure over a small number of rados may bring guarantees 
on the minimization of the expected logloss over examples for the same classifier. The following 
Section exhibits one such algorithm. 
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Algorithm 1 Rado boosting (RadoBoost) 

Input set of rados = {7ti,7t2, T G N*; 

Step 1 : let 6q ^ 0, wq ^ (l/n)l ; 

Step 2 : for t = 1, 2,T 

Step 2.1 : [d] 3 i{t) •<— WFl(§'’, 

Step 2.2 : let 


n 

at 

Step 2.3 : for j = 1, 2,..., n 


1 


7T, 






ii,{t) ) 


1 , I + n _ 

27t„(t) 1 - n 




Wtj ■ 





1 — r 
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Return 6 t defined by Oxh = '^fi{t)=k , V/c G [d]; 


(9) 

( 10 ) 


( 11 ) 


3 Boosting using rados 

Algorithm provides a boosting algorithm, RadoBoost, that learns from a set of Rademacher 
observations S'" = {Tti, 7t2,..., 7t„}. Their (unknown) Rademacher assignments are denoted U = 
{<Ti, <72,..., cr„} C Sm- These rados have been computed from some sample S, unknown to Rado¬ 
Boost. In the statement of the algorithm, Ujk denotes coordinate k of itj, and = maxj \njk\. 
More generally, the coordinates of some vector 2 : G are denoted zi, Z 2 , ■■■, z^- Step 2.1 gets a 
feature index L(t) from a weak feature index oracle, WFi. In its general form, WFi returns a feature 
index maximizing \rt\ in ([^. The weight update was preferred to AdaBoost’s because rados can 
have large feature values and the weight update prevents numerical precision errors that could 
otherwise occur using AdaBoost’s exponential weight update. We now prove a key Lemma on 
RadoBoost, namely the fast convergence of the exponential rado-risk TJ,;^p(S, 0, U) under a weak 
learning assumption (WLA). We shall then obtain the convergence of the logistic rado-risk Q, 
and, via Theorem]^ the convergence with high probability of Fiog (S,0). 

(WLA) 3y > 0 such that Vt > 1, the feature returned by WFi in Step 2.2 (§ satisfies \rt\ > y. 

Lemma 4 Suppose the (WLA) holds. Then after T rounds of boosting in RadoBoost, the 
following upperbound holds on the exponential rado-loss of Ot: 

F4p(S, 0T,11) < exp(-ryV2) . (12) 

(Proof in the Appendix, Subsection |10.3[ ) We now consider Theorem]^ with and therefore 

(5 = 0. Blending Lemma|^and Theorem [fusing Q yields that, under the (WLA), we may observe 
with high probability (again, fixing so (5 = 0 in Theorem]^: 

Flog (S, Ot) < log(2) - ^ + g' , (13) 
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Domain 

m 

d 

lOOcr 

AdaBoost 

err±(T 

ADABOOST(n) 
errio- rL 

RadoBoost 
errio- ^ 

P 

P' 

Fertility 

100 

9 

- 

47.00±18.99 

44.00±16.47 

0.50 

53.00±14.94 

[8:-28] 

0.23 

0.09 

Haberman 

306 

3 

- 

25.72±10.62 

33.01±9.58 

0.50 

26.08±9.94 

[8:-90] 

0.70 

0.02 

Transfusion 

748 

4 

- 

39.42±6.13 

37.83±4.94 

0.50 

39.29±5.76 

[7:-223] 

0.81 

0.36 

Banknote 

1 372 

4 

- 

2.77±1.28 

2.63±1.34 

0.50 

14.21±3.22 

[9:-411] 

£ 

£ 

Breast wise 

699 

9 

- 

3.00±1.42 

3.43±2.25 

0.50 

4.86±2.35 

[4:-208] 

0.03 

0.13 

Ionosphere 

351 

33 

- 

11.69±5.31 

11.70±4.77 

0.50 

15.40±9.93 

[2:-103] 

0.13 

0.09 

Sonar 

208 

60 

- 

26.88±9.36 

25.43±6.61 

0.50 

28.36±8.84 

[2:-60] 

0.76 

0.42 

Wine-red* 

1 599 

11 

1 

26.14±3.10 

26.39±3.15 

0.50 

28.02±2.90 

[4:-479] 

0.05 

0.03 

Abalone* 

4 177 

8 

- 

22.96±1.44 

23.20±1.44 

0.24 

25.14±1.83 

[3:-[l:3]] 

£ 

£ 

Wine-white* 

4 898 

11 

1 

30.93±3.42 

30.44±3.25 

0.20 

32.48±3.55 

[3:-[l:3]] 

£ 

£ 

Magic* 

19 020 

10 

- 

21.07±0.98 

20.91±0.99 

0.05 

22.75±1.51 

[3:-[5:3]] 

£ 

0.01 

EEC 

14 980 

14 

14 

46.04±1.38 

44.36±1.99 

0.07 

44.23±1.73 

[4:-[4:3]] 

£ 

0.86 

Hardware* 

28 179 

95 

- 

16.82±0.72 

16.76±0.73 

0.04 

7.61±3.24 

[2:-[8:3]] 

£ 

£ 

Twitter* 

583 250 

77 

44 

53.75±1.48 

53.09±11.23 

[l:-3] 

6.00±0.77 

[1:-[1:5]] 

£ 

£ 

SuSy 

5 000 000 

17 

- 

27.76±0.14 

27.43±0.19 

[2:-4] 

27.26±0.55 

[1:-[1:6]] 

0.02 

0.39 

Higgs 

11 000 000 

28 

- 

42.55±0.19 

45.39±0.28 

[9:-5] 

47.86±0.06 

[1:-[1:7]] 

£ 

£ 


Table 1: Comparison of RadoBoost (n random rados), AdaBoost [27] (full training fold) and 
ADABooST(n) (n random examples in training fold); domains ranked in increasing d ■ m value. 
Column (resp. “n/2'””) for ADABooST(n) (resp RadoBoost) is proportion of training 

data with respect to fold size (resp. full set of rados). Notation [a:b] is shorthand for ax 10^. Column 
“100 (t” is the number of features with outlier values distant from the mean by more than lOOu in 
absolute value. Column p (resp. p') is p-value for a two-tailed paired t-test on AdaBoost (resp. 
ADABooST(n)) vs RadoBoost. e means < 0.01. 


where Q' is the rightmost term in ineq. Q or ineq. ([^. So provided n <C 2™ is sufficiently large, 
minimizing the exponential rado-risk over a subset of rados brings a classifier whose average logloss 
on the whole set of examples may decrease at rate R(y^/m) under a weak learning assumption 
made over rados only. This rate competes with those for direct approaches to boosting the logloss 
|23j . and we now show that our weak learning assumption is also essentially equivalent to the one 
done in boosting over examples m- Let us rewrite rt{w) as the normalized edge in Q, making 
explicit the dependence in the current rado weights. Let 


1 


E 


i=i 




(14) 


be the normalized edges for the same feature t{t) as the one picked in step 2.1 of RadoBoost, 
but computed over examples using some weight vector w G here, P"* is the m-dim probability 
simplex and = niaxj \xik\- 

Lemma 5 Vm* G P*^, Vy > 0, there exists w G P”^ and y®^ > 0 sueh that \rt{wt)\ > y iff 
\rf^{w)\ > y®*. 


(Proof in the Appendix, Subsection 10.4) The proof of the Lemma gives clues to explain why the 


presence of outlier feature values may favor RadoBoost. 


4 Basic experiments with RadoBoost 

We have compared RadoBoost to its main contender, AdaBoost 12ZI, using the same weak 


learner; in AdaBoost, it returns a feature maximizing \rt\ as in eq. (14). In these basic experi- 
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(a) 



Figure 1: Summary of the DP-related contributions of Section]^ (in color), (a) : usual DP mech¬ 
anism that protects examples (S) prior to delivery to learner (L); (b) : mechanism that crafts 
differentially private rados (R) from unprotected examples ( ^5.1[ ); (c) : mechanism crafting ra- 
dos from DP-compliant examples with objective to improve performances of rado-based learner L’ 

(@)- 


ments, we have deliberately not optimized the set of rados in which we sample IX for RadoBoost; 
hence, we have = Sm- 

We have performed comparisons with 10 folds stratified cross-validation (CV) on 16 domains of 
the UCI repository [2] of varying size. For space considerations, Table presents the results. Each 
algorithm was ran for a total number of T = 1000 iterations; furthermore, the classifier kept for 
testing is the one minimizing the empirical risk throughout the T iterations; in doing so, we also 
assessed the early convergence of algorithms. We fixed n = min{ 1000, train fold size/2}. Table 
displays that RadoBoost compares favourably to AdaBoost, and furthermore it tends to be all 
the better as m and d increase. On some domains like Hardware and Twitter, the difference is 
impressive and clearly in favor of RadoBoost. As discussed for Lemma[^ we could interpret these 
comparatively very poor performances of AdaBoost as the consequence of outlier features that 
can trick AdaBoost in picking the wrong sign in the leveraging coefficient at for a large number 
of iterations if we use real-valued classifiers (see column lOOcr in Table [^. This drawback can be 
easily corrected (Cf Appendix, Subsection 11.1) by enforcing minimal \rt\ values. This significantly 
improves AdaBoost on Hardware and Twitter. The improvements observed on RadoBoost are 
even more favorable. 


5 Rados and differential privacy 

We now discuss the delivery of rados to comply with several DP constraints and their eventual 
impact on boosting. We thus adress both levels (H-ii) of rado delivery in S Our general model is 
the standard DP model |12) . Intuitively, an algorithm is DP compliant if for any two neighboring 
datasets, it assigns similar probability to any possible output O. In other words, any particular 
record has only limited influence on the probability of any given output of the algorithm, and 
therefore the output discloses very little information about any particular record in the input. 
Formally, a randomized algorithm A is (e, 6)-differentially-private [H] for some e,6 > 0 iff: 

P^[0|S] < exp(e)-Pyi[0|S'] + 6,VSR^S',0, (15) 

where the probability is over the coin tosses of A. This model is very strong, especially when 

6 = 0, and in the context of ML, maintaining high accuracy in strong DP regimes is generally 
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Algorithm 2 Feature-wise DP-compliant rados (DP-Feat) 

Input set of examples S, sensitive feature j* G [d], number of rados n, differential privacy 
parameter e > 0; 

Step 1 : let /3 •(— 1/(1 -|- exp(e/2)) G [0,1/2); 

Step 2 : sample <ti, <T 2 , i.i.d. (uniform) in 

Return set of rados {Tt^- : cr sampled in Step 2}; 



Figure 2: How DP-Feat works; neighbor samples S and §' differ by one value for feature j* (i.e. 
one edge coordinate, represented); the rado whose support relies only on the “-1” in § (dashed lines) 
yields inhnite ratio Pyi[0|/]/Pyi[0|F] in (15). This rado would never be sampled by DP-Feat. On 
the other hand, a rado that sums an equal number s of “-|-1” and “-1” (dotted lines) may yield 
ratio very close to 1 (such a rado can be sampled by DP-Feat). 


a tricky tradeoff m- Because rados are an intermediate step between training sample S and a 
rado-based learner, there are two ways to design rados with respect to the DP framework: crafting 
DP-compliant rados from unprotected examples, or crafting rados from DP-compliant examples 
with the aim to improve the performance of the rado-based learner (Figure 5.2). These scenarii 
can be reduced to the design of S,,. 


5.1 A feature-wise DP mechanism for rados 

In this Subsection, we consider a relaxation of differential-privacy, namely feature-wise differential 
privacy, where the differential privacy requirement applies to j^-neighboring datasets', we say that 
two samples S,S' are -neighbors, noted S ssj, S/ if they are the same except for the value of 
the G [d] observation feature of some example. We further assume that the feature is boolean. 
For example, we may have a medical database containing a column representing the HIV status of 
a doctor’s patients (1 row = a patient), and we do not wish that changing a single patient HIV 
status significantly changes the density of that feature’s values in rados. This setting would also 
be very useful in genetic applications to hide in rados gene disorders that affect one or few genes. 
Feature-wise DP is analogous to the concept of a-label privacy [7], where differential privacy is 
guaranteed with respect to the label. Algorithm A in ineq. (15) is given in Algorithm]^ It relies 
on the following subset C Em: 


yhd* — 






7Tl 1 'I 

\{i : ViXij, = -hl}| - — ± I , 


(16) 






















with Ajs = {mj2) — 13{m + 1). The key feature of this mechanism is that it does not alter the 
examples in the sense that DP-compliant rados belong to the set of cardinal 2™ that can be 
generated from S. Usual data-centered DP mechanisms would rather alter data, e.g. via noise 
injection m- Algorithm exploits the fact that it is the tails of feature that leak sensitive 
information about the feature in rados (see Figure]^. The following Theorem is stated so as we 
can pick small 6, typically 6 <C 1/m. Other variants are possible that bring different tradeoffs 
between e and 6. 


Theorem 6 Assume e is chosen so that e = o(l) but e = D(l/m). In this case, DP-Feat main¬ 
tains (n ■ e,n ■ 8)-differential privacy on feature j* for some 5 > 0 such that e ■ 8 = 0(m“®/^). 


(Proof in the Appendix, Subsection 10.5) We have implemented Step 2 in Algorithm DP-Feat in 
the simplest way, using a simple Rademacher rejection sampling where each crj is picked i.i.d. as 
(Tj ~ Sm until (Tj G The following Theorem shows its algorithmic efficiency. 


Theorem 7 For any r| > 0, let n* = r|(l — exp(2/3 — l))/(4/3), and let ur denote the total number 
of rados sampled in until n rados are found in . Then for any r| > 0, there is probability 
> 1 — T] that 


nR < 



1 


_ 1 _ 

mDsB(l-/3||l/2) 


log 


n 



if n < n* 
otherwise 


where D be is the bit-entropy divergence: Dbe{p\\q) = P^og{p/q)-\-{1 — p) \og{{l — p)/{1 — q)), for 
p,q£ (0,1). 


(Proof in the Appendix, Subsection 10.6) Remark that replacing Sm by S,. = would not 

necessarily impair the boosting convergence of RadoBoost trained from rados samples from DP- 
Feat (Lemma 1^. The only systematic change would be in ineq. (13) where we would have to 
integrate the structural penalty Q from Theoremj^to further upperbound Fiog (S, Or)- In this case, 
the upperbound in (j^ reveals that at least when the mean operator in has small norm — 

which may be the case even when some examples in S have large norm — and the gradient penalty 
is small, then Q may be small as well. 

We end up with several important remarks, whose formal statements and proofs are left out 
due to space constraints. First, the tail truncation design exploited in DP-Feat can be fairly 
simply generalized in two directions, to handle (a) real-valued features, and/or (b) several sensitive 
features instead of one. Second, we can do DP-compliant design of rado delivery beyond feature- 
wise privacy, e.g. to protect “rado-wide” quantities like norms. 


5.2 Boosting from DP-compliant examples via rados 

We now show how to craft rados from DP-compliant examples so as to approximately keep the 
convergence rates of RadoBoost. More precisely, since edge vectors are sufficient to learn (eq. 
[^, we assume that edge vectors are DP-compliant (neighbor samples, S ~ S', would differ on one 
edge vector). A gold standard to protect data in the DP framework is to convolute data with 
noise. One popular mechanism is the Gaussian mechanism [laiiB], which convolutes data with 
independent Gaussian random variables ^1(0, <;^^I), whose standard deviation ? depends on the DP 
requirement (e,6). Strong DP regimes are tricky to handle for learning algorithms. For example, 
the approximation factor p of the singular vectors under DP noise of the noisy power method 
roughly behaves as p = D(<^/A) [16] (Corollary 1.1) where A = 0{d) is a difference between two 
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Table 2: Left table: RadoBoost on feature-wise DP compliant rados (Subsection 5.1, showing 


standard deviations) vs RadoBoost on plain random rados baseline and AdaBoost baseline 
(trained with complete fold). Center: test error of RadoBoost minus AdaBoost’s (also showing 
AdaBoost error on right axis, dotted line), for rados with fixed support s (= m*, in green, red, 
blue) and plain random rados (dotted grey). Right: test error of RadoBoost using fixed support s 
rados and a prudential learner, minus RadoBoost using plain random rados and “strong” learner 
of Section]^ (See Tablethrough Table 11). 


singular values. When <;■ is small, this is a very good bound. When the DP requirement blows up, 
the bound remains relevant if d increases, which may be hard to achieve in practice — it is easier 
in general to increase m than d, which requires to compute new features for past examples. 

We consider ineq. (15) with neighbors I and I' being two sets of m edge vectors differing 
by one edge vector, and O is a noisified set of m edge vectors generated through the Gaussian 
mechanism [12] (Appendix A). We show the following non-trivial result: provided we design another 
particular the convergence rate of RadoBoost, as measured over non-noisy rados, essentially 
survives noise injection in the edge vectors through the Gaussian mechanism, even under strong 
noise regimes, as long as m is large enough. The intuition is straightforward: we build rados 
summing a large number of edge vectors only (this is the design of Sr), so that the i.i.d. noise 
component gets sufficiently concentrated for the algorithm to be able to learn almost as fast as in 
the noise-free setting. We emphasize the non-trivial fact that convergence rate is measured over the 
non-noisy rados, which of course RadoBoost does not see. The result is of independent interest 
in the boosting framework, since it makes use of a particular weak learner (wFi), which we call 
prudential, which picks features with |rt| ([^ upperbounded. 

We start by renormalizing coefficients at (eq. ®) in RadoBoost by a parameter k> 1 given 
as input, so that we now have at ^ (l/(2K7r*t(i))) log((l -|- rt)/{l — rt)) in Step 2.2. It is not hard 
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to check that the convergence rate of RadoBoost now becomes, prior to applying the (WLA) 


FyS.Sr.U) < loe(2)-5;^E’'? ■ 

We say that WFi is \p-prudential for Ap > 0 iff it selects at each iteration a feature such that 
\Tt\ < Ap- Edges vectors have been DP-protected as yi{xi + x^, with x^ ~ 3x1(0, (for i G [m]). 
Let ma- = \{i ai = yi}\ denote the support of a rado, and (m* > 0 fixed): 

Y^r = E™* = {cr e Ym ■■ rUf^ = m*} . ( 18 ) 


Theorem 8 VU C Vt > 0, if y/m^ = D (<f ln(l/T)), then 3Ap > 0 such that RadoBoost having 
access to a 'kp-prudential weak learner returns after T iteration a classifier 6 t whieh meets with 
probability > 1 — t; 


F,;,(S.0r,U) < log(2)-—^ 


( 18 ) 


The proof, in the Appendix (Subsection 10.7), details parameters and dependencies hidden in the 
statement. The use of a prudential weak learner is rather intuitive in a noisy setting since at blows 
up when \rt\ is close to 1. Theorem essentially yield that a sufficiently large support for rados 
is enough to keep with high probability the convergence rate of RadoBoost within noise-free 
regime. Of course, the weak learner is prudential, which implies bounded \rt\ < 1, and furthermore 
the leveraging coefficients at are normalized, which implies smaller margins. Still, Theorem]^ is a 
good theoretical argument to rely on rados when learning from DP-compliant edge vectors. 


6 Experiments on differential privacy 

Table presents a subset of the experiments carried out with RadoBoost and AdaBoost in the 
contexts of Subsections 5.1 and 5.2 (see Section 11 for all additional experiments). Unless otherwise 
stated, experimental settings (cross validation, number of rados for learning, etc.) are the same as 
in Section m 

In a first set of experiments, we have assessed the impact on learning of the feature-wise DP 
mechanism: on each tested domain, we have selected at random a binary feature, and then used 
Algorithm DP-Feat to protect the feature for different values of DP parameter e, in a range that 
covers usual DP experiments jl8| (Table 1). The main conclusion that can be drawn from the 
experiments is that learning from DP-compliant rados can compete with learning from random 
rados, and even learning from examples (AdaBoost), even for rather small e. 

We then have assessed the impact on learning of examples that have been protected using the 
Gaussian mechanism [12], with or without rados, with or without a prudential weak learner for 
boosting, and with or without using a fixed support for rado computation. The Appendix provides 
extensive results for all domains but the largest ones (Twitter, SuSy, Higgs). In the central column 
(and Tables 1^ throughin the Appendix), computing the differences between RadoBoost’s error 
and AdaBoost’s reveals that, on domains where it is beaten by AdaBoost when there is no noise, 
RadoBoost almost always rapidly become competitive with AdaBoost as noise increases. Hence, 
RadoBoost is a good contender from the boosting family to learn from differentially private (or 
noisy) data. Second, using a prudential weak learner which picks the median feature (instead of the 
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more efficient weak learner that picks the best as in Section can have RadoBoost with hxed 
support rados compete or beat RadoBoost with plain random rados, at least for small noise levels 
(see Transfusion and Magic in the right column of Table . Replacing the median-prudential weak 
learner by a strong learner can actually degrade RadoBoost’s results (see the Appendix, Tables 


10 and 11). These two observations advocate in favor of the theory developed in Subsection 5.2 


Finally, using rados with fixed support instead of plain random rados (Sectio n [4| ) can significantly 
improve the performances of RadoBoost (see the Appendix, Tables 10 and 11). 


7 From rados to examples: hardness results 

The problem we address here is how we can recover examples from rados, and when we cannot 
recover examples from rados. This last setting is particularly useful from the privacy standpoint, 
as this may save us costly obfuscation techniques that impede ML tasks [Ij. 


7.1 Algebraic and geometric hardness 

For any m G N*, we define matrix Gm £ {0, g^g. 


= 


mx2™ 

as: 


lT 

-*-2^^—1 

Gm—1 

Gm-l 


( 20 ) 


if m > 1, and Gi = [0 1] otherwise (z^ denotes a vector in M*^). Each column of Gm is the binary 
indicator vector for the edge vectors considered in a rado. Hereafter, we let E G the matrix 

of columnwise edge vectors from S, n G the columnwise rado matrix and U G {0, 

which each column gives the index of a rado computed in S^. By construction, we have: 


n = EGmU , 


( 21 ) 


and so we have the following elementary results for the (non) reconstruction of E (proof omitted). 


Lemma 9 (a) when recoverable, edge-vectors satisfy: E = nu'''Gm(GmUU''~Gm) 
n, m are known but n <m, there is not a single solution to eq. (21) in general. 


(b) when U, 


Lemmastates that even when U, n and m are known, elementary constraints on rados can make 
the recovery of edge vectors hard — notice that such constraints are met in our experiments with 
RadoBoost in Sections |4] and [H 

But this represents a lot of unnecessary knowledge to learn from rados: RadoBoost just needs 
n to learn. We now explore the guarantees that providing this sole information brings in terms 
of (not) reconstructing E. VM G we let C(M) denote the set of column vectors, and for any 

C C we let C © e = Uzge®(^)e)- We define the Hausdorff distance, Z1 h(E, E'), between E and 
E': 


I1h(E,E') 

= inf{e : e(E) C e(E') © e A e(E') C e(E) © e} . 

The following Lemma shows that if the only information known is 11, then there exist samples that 
bring the same set of rados C(n) as the unknown E but who are at distance proportional to the 
“width” of the domain at hand. 
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Lemma 10 For any n G suppose eq. 

U G {0,1}^™^"^. Suppose C(E) C for some R > Q 

U' G {0,1}^"*^^^” such that 


(21) holds, for some unknowns m > 0, E G 
Then there exists E' G 


e(E') C ®(0,i?) and H = e'G^+iU' , 


(22) 


but 


Dh(E,E') 


n 


Rlogd \ 
\/d log m J 


(23) 


ifm> 2'^, and L>h(E, E') = Tl{R/y/d) otherwise. 


(Proof in the Appendix, Subsection 10.8) Hence, without any more knowledge, leaks, approxima¬ 
tions or assumptions on the domain at hand, the recovery of E pays in the worst case a price 
proportional to the radius of the smallest enclosing 13(0,.) ball for the unknown set of examples. 
We emphasize that this inapproximability result does not rely on the computational power at hand. 


7.2 Computational hardness 

In this Subsection, we investigate two important problems in the recovery of examples. The hrst 
problem addresses whether we can approximately recover sparse examples from a given set of 
rados, that is, roughly, solve (21) with a sparsity constraint on examples. The first Lemma we 
give is related to the hardness of solving underdetermined linear systems for sparse solutions [9]. 
The sparsity constraint can be embedded in the compressed sensing framework [8] to yield hner 
hardness and approximability results, which is beyond the scope of our paper. We dehne problem 
“Sparse-Approximation” as: 


(Instance) : set of rados S'” = {7ti,7t2, ...,71^}, m G N*, r,i ^ M+, ||.||p, Lp-norm for p G M+; 

(Question) : Does there exist set S = {{xi,yi),i G [m]} and set U = {cri,cr 2 , ...,<t„} G {—1,1}™' 
such that: 


I'^j ~'^(TjWp < 


£ ,\/i ^ [m] , (Sparse examples) 
r ,Vj G [n] . (Rado approximation) 


Lemma 11 Sparse-Approximation is NP-Hard. 


(Proof in the Appendix, Subsection 10.9) In the context of rados, the second problem we address 
has very large privacy applications. Suppose entity @ has a huge database of people {e.g. clients), 
and obtains a set of rados emitted by another entity (g). An important question that @ may ask 
is whether the rados observed can be approximately constructed by its database, for example to 
figure out which of its clients are also its competitors’. We dehne this as problem “Probe-Sample- 
Subsumption” : 


(Instance) : set of examples S, set of rados §” = {7ti,7t2, ...,7t„}, m G N*, p,r G M+. 

(Question) : Does there exist S' = {{xi, yi),i G [m]} C S and set 'll = {cri, cr 2 ,..., <t„} G {—1,1}™' 
such that: 


Ttj — Tto-jllp < r ,\/j £ [n] . (Rado approximation) 
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Lemma 12 Probe-Sample-Subsumption is NP-Hard. 


(Proof in the Appendix, Subsection 10.10) This worst-case result calls for interesting domain-specific 
qualifications, such as in genetics where the privacy of raw data, i.e. individual genomes, can be 
compromised by genome-wise statistics [ElEI]. 


8 Conclusion 

We have introduced novel quantities that are sufficient for efficient learning, Rademacher observa¬ 
tions. The fact that a subset of these can replace traditional examples for efficient learning opens 
interesting problems on how to craft these subsets to cope with additional constraints. We have 
illustrated these constraints in the field of efficient learning from privacy-compliant data, from var¬ 
ious standpoints that include differential privacy as well as algebaric, geometric and computational 
considerations. In that last case, results rely on NP-Hardness, and thus go beyond the “hardness” 
of factoring integers on which rely some popular cryptographic techniques [1]. Finally, rados are 
cryptography-compliant: homomorphic encryption schemes can be used to compute rados in the 
encrypted domain from encrypted edge vectors or examples — rado computation can thus be easily 
distributed in secure multiparty computation applications. 
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10 Appendix — Proofs 

To simplify the proofs, we define the following quantity: 

Ttfj — ^ ^ ,V(T G 


(24) 


so that each rado can be defined as: Tt^- = (1/2) • (Tt^- -|- Tty). We recall that y is the label vector. 

10.1 Proof of Lemma 

We have 


Fiog{§>,d) = ^Y^log (l + exp (^-yie~^ i 
i 

* \y6{-bl} ^ 


Xi 


—L. 1. 

m2 ^ 


m 


m 


m 


1 V- / 1 \ 1 1 

-log 5: exp(^-.e 

- log V exp ( J • e^TXa-] + — ■ log exp 

n \2 J rn \ ) 

(T^Yjrn 

- log V exp ( ^ • 0^(7to. - 7tj 

m V 2 

(TESrn, 

^log ^ expf-^-6»^(7t^+7ty) 
o-es™ ^ 

log(2) + ^log^ exp(-\-e^{7i^ + 7ly) 


(TGStj 


log(2) + ^log^ ^ exp(-0V 


log(2) + -logF/,p(S,0,S^) 
m ^ 


(25) 


(26) 


We refer to ([21]) (Lemma 1) for the proof of 25 Eq. (26) holds because S^, is closed by negation. 
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10.2 Proof of Theorem |3] 

Let us suppose that our set of rados U satisfies: 


It C C Em , 


(27) 


where S,, is a fixed reference subset of We shall use the shorthand Ku[f{U)] to denote uniform 
i.i.d. sampling of U in E^. Furthermore, we also let for short 


= sup max exp(—0 Tto-) 
0e© Tto-eSr 


(28) 


The proof relies on basic knowledge of VC theory and the “symmetrization trick”, which can be 


found e.g. in m ). Plugging eq. ( |28[ ) into the proof of the symmetrization Lemma (Lemma 2 in 
([6])) yields the following symmetrization Lemma for the exponential rado-loss. Notice that the 
assumption is the same as in Lemma 2 in (|6]). 

Lemma 13 For any fixed sample §, for any t sueh that nt^ > 2, the following holds over the 
Rademacher sampling of cr in Em-' 


sup(Eu [Fexp(S, 0 , U)] - F;,p(S, 0, IX)) > t 
0e© 


< 2f -F 


sup(Flxp(S,0,h) -FX,p(§,0,U')) > - 


exp V 


Lflee ‘ ‘ 2_ 

where 'll,IX' are two size-n i.i.d. samples. 

Consider U, IX' C E,-, each of cardinal n and differing from one assignment only. Then it follows, 


for any 0 ^ Q and from ineq. (29): 


2/ 

F” (S,0,U)-F” (S,0,'1X')| < 


n 


(29) 


Applying the independent bounded differences inequality ([20]), we get, for any 0 G Q and t > 0: 




< exp — 


nt 

loF 


(30) 


Letting n(n) denote the growth function for linear separators computed over rados, we still have 
the upperbound 


n(n) < 


en 


d+\ 


d-\-l 


(31) 
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We thus get, for any 0 ^ Q: 


sup(Et; [f;,p(s,0,c/)] - Fi;p(s,0,u)) > t 
0e0 


< 2f 


sup(Fj,p(S,0,U) -Fj,p(S,0,U')) > - 


6»e0 
2 


exp\ 


< 2n(2n)r • P 


< 4U{2n)f ■ P 


Fj,p(S,0,U)-Fj,p(S,0,U')>- 


Eu [f;,p(§,0,c/)]-f;,p(s,0,ii)> 


< 4n(2nK2.exp(^-^^ 


< 4 


2en 

d + 1 


d+1 


• exp - 


TeFy 


(32) 

(33) 

(34) 

(35) 

(36) 


Ineq. (32) follows from Lemma 13, ineq. (33) follows from standard VC arguments (see e.g. (0), 
Section 4), ineq. (34) follows from the observation that event a—b > u implies (a—c > u/2)V(6—c > 
u/2), ineq. (35) follows from (30), and finally ineq (36) follows from ineq. (31). Picking 

t = t* = 16£ ■ 


/1, . d, 2en 1 1 

- log t + - log —— + - log - 
n n a n r\ 


(37) 


yields that the right hand-side of ineq. ( |36[ ) is not more than r|, for any r) > 0. So with probability 
> 1 —r|, any classiher 0 G 0 will enjoy E[7[P)r^p(S, 0, [/)] < Fgxp(S, 0, 'Ll) and so we shall have: 

7Fog(S,0,7[) 

= log(2) + -.logF;,p(S,0,l[) 


m 

1 


> log(2) + - • log (E,; [Fj,p(S, 0, U)] - u) 


m 


= log(2) + -.log (E,; [Fj,p(S,0,C/)]) 


m 


1 


1 


d. 2en 1 


1 


H— - log 1 - 16£» • W - log ^ + - log —- + - log - 
m \ Vn n a n r\ j 


(38) 


= log(2) + 1 . logF;,p(S, 0, + log 


1 


1 


d, 2en 1 


1 


H— - log 1 - 16£» • W - log ^ + - log —— + - log - 


m 


— Fog (S) ^) 4-• log 


m 


n n 

F4p(S,0,S„ 


d 


n 




1 


1 


d. 2en 1 


1 


H— - log 1 - 16£» • W - log 7 + - log —— + - log - 


m 


n 


n 


d 


n 




(39) 


In eq. (38), we use the fact that g = l/E\j [pJxp(S, 0, U)] and Ejj [Fgxp(S, 0, U)] = 0, S^). 

Hence, reordering the expression yields that with probability > 1 — r|, the final classiher 0 will 
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satisfy: 


F\og{§>,d) < 


Kg{^,0,U) 


F;,p(g,0,S,) 
m FJxp(S,0,Sm) 


-log 



log i + d log 


2en 



(40) 


There remains to use the fact that i < exp(r6)maxs^ ||7to-||2) to complete the proof of ineq. ([^ in 
Theorem To prove ineq. Q , let us call 1 — z the quantity inside the log in ineq. pO] ) . We 
clearly have to have 0 < ^ < 1, and so for any value of 2 and for any 0 < a < 1, there exists a 
value m* > 0 such that 


- log- 

2 1 — z 


(> 0 ) 


(41) 


for any m > m*. In this case, we get after reordering, since 1 — z' < expz', 

z , ( z 


1 - 


mz 


< exp 


mr 

1 


< exp — log(l — z' 
m 


(42) 


and so, taking logs and using ineq. (39), we obtain that for any 0 < /3 < 1/2, there exists m* > 0 
such that for any m > m*: 

1 , FJ,p(S,0,S,) 


i"iog(S,0) < Fi;g(S,0,lt)---log 


m 




re 

• max 

1 

Tta- 

+ 

J n 

T,r 

m 


2 nm 


2en 1 
d nm 


Calling \ — z' the quantity inside the log, there remains to use log(l — z'') > —Kz' for some K > Q 
when z! is sufficiently close to 0 (hence, m sufficiently large again). This proves ineq. (|^ and 
completes the proof of Theorem Remark that provided n is sufficiently large, the right hand-side 
of ineq (41) admits the following equivalent: 


1, 1 

-log-- 

2 ; \ — z 




(44) 


with 2 ; = n(l/\/n) (omitting the dependences in the other parameters). Hence, ineq (41) can be 
ensured as long as m is large enough with respect to maxs^ ||(l/m) • 7tcr||2 (which cannot exceed 
the maximum norm of an observation in S), d and log(l/T|). 

So, when we apply this last result to RadoBoost, it says that for a large enough sample, 
we can indeed pick an n sufficiently large but small compared to m so that we shall observe with 
high probability a decay rate of the expected logistic loss computed over §, E[Fiog (S, Ot)]-, of order 
H(y^/m) (expectation is measured with respect to the sampling of R). 
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We 


are now left with 

-Q = - 


proving ineq. and so we study: 
j exp(-6>^7t^/) \ 

V ^ exp(-0T7t,) j 

, ( l^ml E^'es, exp(-0T7t^/) \ 

\ |Sr| E^es^exp(-0T7r^) j 


1 

m 

1 . loff E,.'es.exp(-^^7t^') 

^ V E^'es, E^es^ exp(-6»T7t^) 

Ecr'PV„ l^rr^y.^ c^yy-u ,v„ 


1 

m 

1 

m 


• log 


'gs, Eo-es^ exp(-0~^7t^) • exp(-0~r(7t^/ - 7t^))' 


'-f ^ 

X^cr'gSr 5^0 


-D 

7/1 \ ■ ' 

with D{a,a') oc exp(—0^7tcr). Jensen 


^/TGS™ exp(-6»T7t„ 
exp(-0'^(7to./ - Tto-)) 


• log (lE(o.,o./), 

’s inequality yields: 

Q — ■ 

m ^ / 

1 


We now remark that 


m ^ / 


e^Tta 


771 ^ ’ 7 


0^7to 




E/t'gs. E,tge^ exp(-0T7t^) . 0'7t, 

E, ^ 


and furthermore 

1 


= 

= 0^E, 


_ m 

'gS,. Eo-gSm exp( —0^7tcr) 


>er'gSr• (EcrgSrn 

Eo-'gSr EcrgS^ exp(-0T7x;^) 


'cr~Er [^cr] 


®^(cr,crM~D 
m V » y 




1 _ E/T'gs, E/Tgs^ exp(-6>^7t^) • 6> 
E<T'gE. E^gE„ exp(-6>T7t, ' 
^ _ Eo-gE^ exp(-0^7t^) • 0^7t^ 
rn E,TgE^ exp(-0T7t^) 

_ qT f _ EergE^ exp(-0^7to-) • Tto- 
E,TgE^ exp(-6»T7t„ 


|T 


Ttc 


m 

|T^ 1 


e^Ve 

qT 


9--logF;,p(§,0,S^) 
m ^ 

= e^VeF{,^{§,e,Em) . 

Assembling eqs (46) and (47), we get from ineq. (45): 


Q < re VeFjQg (S, 0, S^) - IEcr~E^ 


' 1 

— -Tto- 

m 


< 


I claimed. 


re(^\\V0F{,^ (S,0,S. 


‘m) lb + 


lEcr~E,. 

■ 1 

— ■Tta- 



m 



(45) 


(46) 


(47) 
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10.3 Proof of Lemma m 

Theorem 1 in ([22]) immediately yields 


^ T 

- exp (-0? < Yl • ^(T+i)i , Vj G [n] . 

t=i 


Since l^tur+i = 1; summing over j G [n] yields: 


(48) 


F;,p(S,0t,1I) < nv^ 

i=l 
< exp 


- rt 


E- 


Using the (WLA), this yields ineq. (12). 

10.4 Proof of Lemma [5] 

Fix for short k = t(t). We rewrite rt{wt) as a function of the examples: 


rt{wt) = - 

Tr-*k 

1 

1 


1 ” 

; Wtj'^jk 


i=l 




jVi^ik 


j=l i:(Tji=yi 
m 


Define w E such that 


with 


1=1 


1 X 


EU^' E 




/ I Vi^ik 


j-cTji—yi 


Wi = — ■ —^ Wtj ,Vz G 

J3 1 —y I 


^ 


m 


'^*k 


i=l j-.(Jji=yi 
n 


J2wtj\{i ■■ CTji = yi}\ 


i=i 


(49) 


(50) 


(51) 


the normalization coefficient. Because Wt G x^k > 0 and Tt*^ > 0, it comes that indeed w G 
and IF > 0 (unless is reduced to the null rado). We thus have \rt{wt)\ > y iff 


\rT{^)\ > 


X 

IF 


This proves the statement of the Lemma. Remark that 


X^k 


< IF < 


Xi^k 


'^*k 


maxj \{i.crji=yi}\ 


(52) 


(53) 
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so if we assume the weak learning assumption holds for the examples, \rf^{w)\ > > 0, then the 

weak learning assumption over rados always holds for 


y = 


_ ^ex 


and may holds for a value y which can be as large as 


y = 




maxj \{i:aji=yi}\ 


■r 


(54) 


(55) 


These two bounds are data dependent (but they depend on data only), and whenever they are 
significant outlier values for feature k, i.e. x^k is achieved by few examples and all others have 
feature value of significantly smaller order, then the available y can be signihcantly larger than y®^. 
Compared to the cases where no such outliers would exist, we thus may expect significantly better 
results for RadoBoost. 


10.5 Proof of Theorem 1^ 

To ease notations hereafter, we consider wlog that d = 1 and so j* = 1- We also drop index notation 
in related notations (so becomes Sm)- 

We let S and S' denote two j-neighbors, so that S S' holds and they differ by the value of 
one (boolean) feature. Algorithm DP-Feat selects uniformly at random the rados in sets 

^m(S) ~ ^ • '^cr £ !(§)} ; (56) 

S^(S') = {creS™:7T^eI(S')} , (57) 

with 


I(S) = { —(m — m(-|-))-|-/3(m-|-1) < z < m(-l-) —/3(m-b 1)} , (58) 

I(S') = {-{m - /3{m 1) C < z < m{-^) - P{m 1) C} , (59) 


since ni'{+) = -\- C for some C G {—1,0,1}. To relate the sizes of these two sets, we first 

compute the size of {a : tTo- = r|S}, for r G Z. Assuming first r > 0, we have: 


\{a : TXa- = r|S}| 
If r < 0, then similarly: 

\{a : n„ = r|S}| 



(60) 


( 61 ) 


which is the same expression as (60) with the substitutions r i—)■ —r, m{-\-) i—?• m—m{-\-), m—m{-\-) i—?■ 
m(-|-), so we have only to analyse the case r > 0. If m(-|-) — r > m —m(-|-), we have by Vandermonde 
identity: 


|{cr : 7t^ = r|S}| 



(62) 
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If m(+) — r < m — m{+), then it is not hard to show that Vandermonde identity still brings (62). 
We thus have 


|S^(S)I = 


> 


> 


m(+)-/3(m+l) 

E 

r=-(m-m(+))+/3(m+l) 


m 


l3{m + 1) 


+ 


m 

m(+) — r 
m(+)—/3(m+l) 

E 


r=—(m—m(+))+/3(m+l)+l 


m 

m{+) — r 


m — j3{m + 1) + 1 


I3{m + 1) 


m 


+ 


m(+)-l3{m+l) 

E 


-1 • 


m 


m 


+ 


+ 


m 

m{+) — r 


m 

m{+) — r 


Bim + 1) — 1. 

r=-(m-m(+))+/3(m+l)+l 
m{+)-l3{m+l) 

B(m + 1) — 1 / ' 

r=-(m-m(+))+/3(m+l)+l 
m(+)-/3(m+l) 

B(m + 1) — 1 / ' 

r=—(m—m{+))+j3{m+l)+l 
m(+)-,9(m+l)+l 

E 

r=-(m-m(+))+/3(m+l)+l 

m(+)-/3(m+l)+l / I \ I 1 

_ ^2 m-(+j + 1 - r 

r=—(m—m(+))+/3(m+l)+l 
m(+)-/3(m+l)+l 

m — j3(m + l) + l V(m(+) + l) — r 

r=—(m—m(+))+/3(m+l)+l 


m 

m{+) — r 


m 

m(+) — r 


m 


m — m(+) + r \(m(+) + 1) — r 
/5(m + 1) 


m 


= ^ -1 


= 


-1 m(+)-/3(m+l)+l 

E 

r=-(m-m(+))+/3(m+l)+l 

•|S^(S')I 


m 

(m(+) + 1) - r 


(63) 

(64) 

(65) 

( 66 ) 


23 






if C = 1, and 


|S^(S)I = 


m 

m{+) — r 


m 

/3{m + 1) 


m —/3(m + l) + l / m 

+ 1 ) - 1 


m 

m{+) — r 


> 


m(+)-/3(m+l) 

E 

r=-(m-m(+))+/3(m+l) 

m(+)-/3(m+l)-l 

+ E 

r=-(m-m(+))+/3(m+l) 

m(+)—/3(m+l) —1 

+ E 

r=-(m-m(+))+/3(m+l) 
m(+)-/3(m+l)-l 

+ E 

r=-(m-m{+))+p{m+l) 
m(+)-/3(m+l)-l 

E 

r=—(m—m(+))+/3(m+l) 


/3(m + 1) 


m 

m(+) 


— r 


1 


m 

/3{m + 1) — 1 


m 

m(+) — r 


m 

/3{m + 1) — 1 


m 

m(+) — r 


> 


m(+)-/3(m+l)-l 

E 

r=—(m—m(H-))+/3(m+l) —1 

i-l) '-ISiH')} 


m 

m{+) — r 


if Q = —1. The last inequality follows from the same chain of inequalities as in eqs. (63 
now bound the ratio of probabilities for the rado being equal to r, for both sets: 


(67) 
-|66D. We 


_ |S^(§')| L(+)-r) 


= ^|S'] 


< 


S^(S)| Li+T+C-r) 

) ( m \ 
\m{+)-r) 

' LEET) 




( 68 ) 




< U-l 


(m(+) + C “ r)\{m — m{+) — C + ?')! 

(m(+) — r)\{m — m(+) + r)! 

if c = 1 

m—m(+)+r ^ 

1 if C = 0 

m-m{+)+l+r ^ ^ 

m(+)—r ^ 

(69) 

The last inequality comes from eq. ( |58[ ) which guarantees r > — (m — m(+)) + /3(m + 1), and so 

(70) 


13 


m(+) + 1 - r ^ 1 

- 


m — m{+) + r 


and furthermore eq. (58) also guarantees r < m{+) — (5{m + 1), and so 


m — m{+) + 1 + r 1 

-EELJ-— <-1 

m(+) — r (3 


(71) 
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as well. We finally get from ineq. (69): 


(S) 


£cs'( ~ ''’1^] 


g,(S') “ ^1^'] 


< exp(e) , 


(72) 


cr~S 


which holds for any r G s^(S) n Sm(S0- Notice however that the symmetric difference of these 
two sets is not empty. To finish the proof, we need to take into account this symmetric difference. 
This is the data-dependent step in DP-Feat which may leak information about one feature and 


disclose its content, through the use of eq. (58). To see this, if we assume that one possesses all 


the data but the unknown feature value for one person, and knows how rados are computed using 
DP-Feat, then by observing the output Tto-j*, he may guess the unknown value, as depicted by 
Figure]^ Let us denote A this event. When returning one rado from if we consider without 

loss of generality a uniform distribution over examples, then, referring to the notations of Figure 
we have: 


P[y4] = P[^|S]P[S]-fP[A|§']P[S'] 
< P[7l|S] + P[7l|S'] . 


(73) 

(74) 


If A occurs in S, then it is for r = m(-t-) — {m — I3{m + 1)) in Figure]^ We get from eq. (62): 

(m—/3(m+l)) 


F[A\§] = 


E m- 
T= 


m—P{m+l) fm 


o 




E m-/ 3 (m+l) i'm\ ' 
r’=/3(m+l) \rJ 

and we obtain following the same reasoning, using the fact that r?T,(-|-) increases by one in S', 


(75) 


P[7l|S'] = 




m—/S(m+l) fm 


E fft- 
r -— 


r=/3(m+l) 


o 


(76) 


The probability of hitting the symmetric difference of T,m{§) H T,m{§') is taken into account con¬ 
sidering 5 = P[A] in the (e, 5)-differentially private release of one rado. We get: 


6 < 




E m-/3(m+l) /Tn\ 
r=/3(m+l) \r) 

The interplay between e and 5 can be appreciated throughout the use of the following properties: 


(77) 


we have used 



H{z) 

u 


-zlog2Z- 


13 


1-/3 

m 


(1 


2;)log2(i - z) , 


(78) 

(79) 
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We get 

2 1 

Because H{u) is concave, it satisfies (fixing e' = e/2 for short): 


( 80 ) 


H{u) < H{/3) + {u-p)H'{/3) 

= H{I3) - —^og2 —^ 


= H{P)- 


m 

(l-/3)e' 


m 


j^.(l 06 (l+exp(£'))-(l + i)- 


e' exp e' 

1 + exp e' 


We have: 


1 1 


1 _ 2"i-(/(e')-i) 2m2log2(2)e' \2 Am?\og^{2) 


= fie') 


+ 0 (e') 


(81) 


(82) 


So, assuming e' = o(l), there exists m' > 0 and a constant K > 0 such that for any m > m', 

8 < K- ^ . (83) 

m 2 e 

Finally, we get that when e = n(l/m), (e, 6 )-differential privacy can be ensured on the delivery 
of n = 1 rado as long as e • 6 = 0(m“^/^). Taking into account the fact that rados are generated 
independently and using Theorem 3.16 in [12] concludes the proof of Theorem]^ for arbitrary n. 

To finish the proof, we remark that Sm(.) / 0. Indeed, since m > 1, (3 < m/{m + 1); 
furthermore, as long as m > 2 , provided we also have 


1 + 2/3 
1 - 2 ^ 


0(m) , 


we shall have I(S) H Z 7 ^ 0. This can easily be ensured if 


1 

- + e 
e 


0(m) , 


i.e., provided e = o(l), e = Q{l/m). 


(84) 


10.6 Proof of Theorem 0 

We keep the same notations as in the proof of Theorem [^ The Rademacher rejection sampling of 
cr has a probability to reject a single rado bounded by (a fraction of) the tail of the Binomial, as 
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Figure 3: Knowing everything (including DP-Feat) but the actual feature value for a particular 
individual (in black), one can hack this unknown if he/she is returned by DP-Feat a rado whose 
value TTo-j, falls within the two red dots: if it is the left one, the value is —1, and if it is the right 
one, the value is -|-1. The probability of hitting one of the red dots for one rado is P[A] in eq. (73). 


indeed 


< 


< 


< 


T ( ”* ) 

^ V"i(+) — r) 

r<-(m-mj,(+))+/3(m+l)Vr>mfc(+)-/3(m+l) 



E 


m + 1 — r 


2™ ^ m -Fl 

r=(l-/3)(m+l) 


2/3- — 
2 ”^ 


E 

r=(l-/3)(m+l) 
.. m+1 

I?- — - T 

^ 2'm / 

r={l-p){m+l) 


m + 1 
r 

m+1 
r 


("T) 



4/3exp (-(m + 1) • Ds£;(! 47-/3||l/2)) , 



(85) 















where Dbe the bit-entropy divergence (|3]): 


DBE{p\\q) = p\og- + {l-p)\og\ —( 86 ) 

q l-q 

The last equation follows e.g. from Theorem 2 in ([!]). So the probability p that there exists a 
rado, among the n generated, that was rejected at least times for some > 1 satisfies 

OO 

p < 4n/3 ^ exp(-(m-hl)-/3||l/2)) 

t=Tr 

OO 

= 4nj3 ■ exp(-(m + 1) ■ Tr ■ - /3||l/2)) • ^exp (-(m + 1) • t • Dbe{4 - /3||1/2|^7) 

t=o 

We now use the facts that (i) m > (1 -|- 2/3)/(l — 2/3) (Step 2 in Algorithm DP-Feat), and (ii) 
function 

f{z) = • (log( 2)(1 - z)log(l - z)-hzlogz) ( 88 ) 

is convex over [0, 1 / 2 ) and has limit tangent 1 — 2z in z = 1 / 2 , so 

exp (-(m -h 1 ) • DBEil - /3|| 1 / 2 )) < exp ' 0 og( 2 ) + (1 “ /3) log(l - l3) + P log /3)^ 

< exp( 2 ^ - 1 ) (< 1 ) , 


and it comes 


^exp(-(m-F 1) • t • Dbe(1 - (3\\l/2)) 


< 


and so 


t=o 


P < 


1 — exp(2/3 — 1 ) ’ 


4n/3 


1 — exp(2/3 — 1 ) 


So, if n,/3,r\ are such that 


n < 


exp (-(m + 1) - Tr - Dbe (1 - (3\\l/2)) 


Ti(l-exp(2^-l)) 

4/3 


(89) 


(90) 


(91) 


then there is probability > 1 — r| that no rado was rejected. Otherwise, with probability > 1 — t], 
each rado among the n was rejected no more than 


_ 

1 y. 


1 


Tog 


4/3n 


mDBE(l -/3||l/2) ri(l - exp(2/3 - 1)) 


(92) 


times. There remains to multiply this bound by the number of rados to get an upperbound on the 
number of iterations of Rademacher rejection sampling, and we obtain eq. This finishes the 

proof of Theorem 
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Figure 4: Left: function /(/3) as depicted in eq. (93). Right: same function over smaller range, 
depicting the value of / for e = 0.1 (thick dark line) and e = 0.01 (slim dark line). 


Remarks: the actual dependence of eq. (92) on /3 is such that unless e is extremely close to cQ in 
which case the requirement on differential privacy is the strongest, T* does not actually blow up. 
To see this, let us define 


fiP) 


1 , 4/1 

Dbe{^ - l3\\l/2) 1 - exp(2/3 - 1) 


(93) 


Figure 1^ displays /(/3) over different ranges. One sees that when e = 0.1, provided m/logn is in 
the order of thousands and n ^ e, then T* is in fact of the order log(l/ri), which may be quite 
small indeed. 


10.7 Proof of Theorem Is] 

Let us first remark that the DP-protection of vector edges by computing noisified example set 

= {(xt,yi) = {xi + xl,yi),i £[m]} , (94) 

where ~ 11(0, ?^I), is equivalent to noisifying edges because label y G {—1,1} and the pdf of the 
Gaussian mechanism is invariant by multiplication by y. 


The key quantity to prove the Theorem is, for any noisified rado 7t^ = (1/2) • [aji + yi)xf, 
the support mj = |{z : aji = yi}\ of the rado. We also renormalize the leveraging coefficient in 
RadoBoost, replacing eq. ([To| in RadoBoost pseudocode by: 


1 


at 


2k7T, 


*L{t) 


log 


1 + 


( 95 ) 


for some hxed k > 1 . 


^Recall that fi = 1/(1 + exp(e/2)) in Step 1 of Algorithm DP-Feat. 
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We now embark in the proof of Theorem Lemma 2 in ( [22] ) yields 
exp = exp i-OT'Jtj ) • exp { ^ • 0^ ^ {uji + yi)xl | 


- • nw^T+i)j^ ■ exp X] ^ N -(96) 


Averaging over j £ [n] yields: 


^exp(S, ^T, U) < ^ ^ " ^^(r+i)j • exp f ^ X] 


^t=i 


i=i 

n 


- ^ ^ ^ '^iT+i)j • exp ^ {aji + yi)xl 


(97) 


t 


B 


with = n^~^w^rp_^-^^^y The right-hand side of ineq. (97) multiplies two separate quantities, 

A which quantifies the performances of Ot in RadoBoost on the set of noisy rados on which it 
was trained, and B which is an expectation, computed over wt, of the agreements between 6t and 
the noisy part of the rados. When rados are noise-free and k > 1, we have x\ = 0, Vi and 


i=i 


1 ^ 

= n« • — w 


n ^—<■ 

j=i 


< n« 


n 




j=i 


i _i 

= ' n fc = 1 


(98) 


because of the concavity of and so we return to the noise-free rado boosting bound with 

“penalty 1 /k” for renormalizing the leveraging coefficients in RadoBoost (this proves ineq. (d!])). 
Assuming Qt output by RadoBoost, we obtain, VS, it such that support of all n rados is of the 
same size, i.e. rrij = m*,Vj G [n], 


Fi;g(S,0T,U) 


1 


= log(2) + -logF4p(S,0T,it) 


m 

1 


1 


- ^ ^ ^ m ■ ^ • """P 2 ■ ^ 


n 


- ^ E ^ ■ >»8 E ■ «p 

t j=i 

2Km 


= ‘”15(2) - ss; E ^ ■ ‘”S E At+ 1 ), ■ exp 


— • ((Tji + yi)xl 


/m* - X . — * 


( 99 ) 


=c 


=D 
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We now study a sufficient condition for C — D io he VL{{l/m) with high probability over 

the noise mechanism, thereby ensuring a convergence rate over non-noisy rados that shall comply 
with the noise-free bounds of ineq. (13), up to the hidden factors. This shall be achieved through 
several Lemmata. 

Lemma 14 With probability > 1 — t over the noise mechanism we shall have: 

<^ji + Vi 


E 





/ /'n\ 

< 

j21og(- 

2 

V Vt/ 


( 100 ) 


Proof The Sudakov-Tsirelson inequality (0, Theoiem 5.6) states that if x lNf(0, I^^) and fi^x'j . 
—>■ M is L-Lipschitz, then 


'[f{x)-E[f{x)]>t] < exp 


2L2 J 


( 101 ) 


Since function f{x) = ||a ;||2 is 1-Lipschitz by the triangle inequality and ^ standard 

Gaussian random because the x^ are sampled independently, ineq. (101) yields that we shall have 
simultaneously over the randomized part of the rados, with probability > 1 — t, 

T Vi 


E 


r®,- 



/ /n\ 

< 

j21og(- 

2 

V Vt/ 


which proves the Lemma. 


Lemma 15 Assume 6t G (B(0, r^) for some rg > 0. Then with probability > 1 — t over the noise 
mechanism we shall have 


D < 



( 102 ) 


Proof We use Lemma 14 


Cauchy-Schwartz inequality implies 



i 


+ Vi 


< 


< 


ll^rlb • 


(^ji + Vi r 
^ 2qy/m; * 



We thus get in this case 


(103) 


i=i 


< — J2m,log(- 

m V '"T 


(104) 


because of ineq. (98). 


We now prove a specihc rg > 0 which makes use of the concentration of the randomized part of 
rados in Lemma [TH 
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Lemma 16 Suppose there exists /x > /x' > 0 such that it simultaneously holds: 

mill/; maxj |7Tjfc| 


h < 


m* 




At' < — 


(105) 

(106) 


where = (1/2) + yi)^ik is the non-noisy part of rado . Assume the existence of p > 0 

such that the weak learner WFi in RadoBoost is Xp-prudential for 

(107) 


^ = 1 - n - / - 

Vl — pK/x'm* 


Then probability > 1 — t over the noise mechanism we shall have 

IWrh < (1 - P)'^r^ . 


(108) 


Remarks: notice that ineq. (105) is equivalent to saying that each coordinate k has at east one 
non-zero entry in the noise-free part of the rados. Unless coordinate k is zero for all examples — 
in which case we can just discard this feature —, this assumption is easy to satisfy. 

Proof We have 


W^rh-'^rf = ^ 


1 


t 


log' 


2 1 + U 2 
- r. 


^-n 

Assuming the existence of 2 ; > 0 such that > ZjVt, and using the fact that 


log - < 


1 — X (1 — |x|)2 


,Vx G (0,1) 


we shall have 


E 


t 


log 


2I + rt 2 


1-rt 


-rt < 


< Ei- 


I+ rt 


- r. 


n 




2;2 (l_|ri|)2 


- n 


< - 


E- 

t 

oY.' 


2;2(1 - |rt|)2 


t 5 


as long as 


\n\ < 1 - 


,Vt , 


(109) 


( 110 ) 


( 111 ) 


( 112 ) 


\/l - Pz 

where p G (0,1). Since > minfcmaxj |7t^|, we can fix 2 * = 2Kminfcmaxfc |7t^|, but recall that 


n 


sums a random Gaussian part and a non random part. Ineq. (100) tells us that with high 
probability, the magnitude of the random part will satisfy 


+ < ?y2m*log ,Vj G [n] . 


(113) 
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Thus, we shall have in this case, using ineqs. (105, 106) and given Lemma 14 


mm max n , 
k 


^ —log L 


m* 


n 


m* 


> /U m* 


and we get the statement of the Lemma. 


We now return to ineq. (99), and use Lemmata 14, 15 and 16, and obtain that with probability 
> 1 — T, a sufficiently prudential weak learner shall imply: 

Ff,g(S,0T,U) 

1 ^ 

S r, + — . log U.,T+1« ■ exp 


1=1 


/m* 




T Vi _ 
2?^/ruT' 


tX, 


^ '“s(2) - - ■ (>“s (x) ) E 


2k 


(114) 


=E 


We want E > 1/(4k:). Equivalently, we want 

1-p < 


4K<jY'^2m^k)g^^ 


(115) 


and for the prudential weak learner to exist, we also need 

4 

1-p > 


K2/i'2777,2 


(116) 


Assuming ineqs (105) and (106), we thus get that if 

4? 


K > 


/i'2 777,2 


21og(- 


(117) 


then there exists a prudential weak learner for which, with probability > 1 — t over the noise 
mechanism, we shall have after T rounds of boosting of RadoBoost, using the prudential weak 
learner and renormalizing the leveraging coefficients by k as in (95), 

1 


F,;,(S.0r,U) < log(2)-— 


(118) 


which proves Theorem Notice that the constraint k > 1 can easily be enforced by picking n' 
sufficiently small. 


Remarks: we finish by emphasizing the fact that ineq. (19) is computed over non-noisy rados. It 
is not hard to see that ineqs (105) and (106) shall be all the easier to meet as m* is large compared 
to log n, log(l/T) and ?. So, provided rados have a sufficiently large support, the convergence rate of 
the logistic rado-risk of RadoBoost over the non noisy rados may compete, up to a small constant 
factor, with the one that would be achieved by training RadoBoost over non-noisy rados. 
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Figure 5: Construction for the proof of Lemma 10 
one ball, in blue, contains no snch edge vector. 


Black dots denote edge vectors from S; at least 


10.8 Proof of Lemma 1101 

Consider first that m > 2'^. A simple proof of the Lemma consists in considering the largest d-dim 
square, of edge length i = 2R/^/d, shown with thick dashed line in Figure]^ We then pack this 
square with m + 1 spheres, as shown. Since the edge length is covered by [log(m)/log(d)] diameters 
of these spheres, we obtain that the radius r of each such sphere satisfies: 


2R 

vs-r'^i 

Rlogd 

2^/dlog{m + 1) ’ 


(119) 


because m > 2'^ > d. Because of the construction, at least one of these spheres does not contain an 
edge vector from C(E) and is thus empty. Consider one such empty sphere whose center e* is the 
closest to 0, as shown in Figure]^ and consider one adjacent sphere, located no fartheiQ with one 

^If no such sphere exists, we can pick e, = 0, the center of a sphere ®(0,r) which contains no example from S. 
In this case, there is no need to remove any example from S: the proof still holds by adding example (0,1/) to S, to 
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edge vector e = yx from C(E) inside, with (x, y) G S, where S generates 11. We create §>' out of S 
by replacing {x,y) by two examples, (ye*,y) and (e — ye*,y). It is worthwhile remarking that 

e(E') C ®(0,i?) (120) 


by construction, and furthermore any rado that can be created from § can also be created from 
S'. Hence, any n defined over S can also be obtained from S'. There remains to remark that, by 
construction, e* is distant from every edge vector of S from at least r, and so: 


t»h(e,e') = n 


( R log d 
V Vd log m 


( 121 ) 


this proves Lemma 10 when m > 2*^. When m <2‘^, the construction of Figure can still be done 
but with larger balls, for which 


R 

Picking as e* the center of any of these empty balls, we obtain 


Z1h(E,E') > 


R 

2Vd ’ 


( 122 ) 


(123) 


as claimed. 

10.9 Proof of Lemma 111! 

We make a reduction from the X3C3 ([25]) problem whose instance is a set 5 = {si, S 2 ,..., Sn} and 
a set of 3-subsets of S', C = {ci, C 2 ,..., c^}, and an integer m. Each element of S belongs to exactly 
three subsets of C. The question is whether there exists a cover of S using at most m elements 
from C. The reduction is the following: 

• to each feature corresponds an element of C; 

• to each element Sj of S we associate a boolean rado ilj which is 1 in coordinate k iff Sj G c^, 
and zero otherwise: 


~ IffciSjScfc} • (124) 

(Ij is “1” in coordinate ik for k £ 3, and zero everywhere else) 

• The number of examples is m; 

• Parameters r and i are fixed as follows: 

— if p 7 ^ 0, the value of r is 2^!^. We also fix £ = e-machine, where e-machine is the smallest 
e such that 1 — e < 1 in machine encoding; 

— else if p = 0, then r = 2 and l = \\ 

create S'. 


35 







Let us number the constraints of Sparse-Approximation, so that we want; 


\x. 


l\\p 


< £ , Vi G [m] , (Sparse examples) 


\7tj — Tto-j llp < r ,\/j £ [n] . (Rado approximation) 


(125) 

(126) 


Suppose there exists a solution to X3C3 with m subsets of C, C* = Create m 

positive examples [ui = 1 ) whose observation is Xi = IjA;;} (the all -0 vector with only one “ 1 ” in 
coordinate ki). Clearly, the sparsity constraint on examples (125) is satisfied. We craft the rados 


following n Rademacher assignations, where <Tj is -|-1 only for xj.., and —1 otherwise. Notice that 




= 1 
= 1 


{fclSjSCfc} 


It comes 


if p 7 ^ 0 , and 


Tto-jllp 


< 2 ^/^ = r ,Vj G [n] , 


|7tj - 7t„. Ilo < 2 = r , Vj G [n] 


(127) 

(128) 

(129) 

(130) 


otherwise, since each element of S belongs to three sets in C. Therefore, there exists a solution to 
Sparse-Approximation. 

Now, suppose there exists a solution to Sparse-Approximation. Remark that we can remove 
wlog any example having null observation as this does not change the feasibility of the solution. 


Consider the case where p 7 ^ 0. The Rado approximation constraint (126) of Sparse-Approximation 
makes that the following property (P) is satisfied: 

(P) for each j G [n], there exists i G [m] and feature k G [d] such that Tta-^ and example Xi have 
their coordinate k non-zero, and furthermore the coordinate in Xi has magnitude exactly e: it 


cannot be less otherwise (126) is violated, and it cannot be more otherwise (125) is violated. 


Hence, each of these Xi have exactly one non-zero coordinate. 

Because property (P) holds for all rados, we see that the corresponding indexes in the Xi (the 
corresponding non-zero coordinates for features for which (P) holds; there cannot be more than 


m) define a solution to X3C3. The case p = 0 is easier as (125) enforces the number of non-zero 


coordinates in each observation to be at most one, and therefore exactly one since there is no null 
observation. 

We finally note that Sparse-Approximation trivially belongs to NP, so it is actually NP-Complete. 


10.10 Proof of Lemma 

We make the same reduction as for Sparse-Approximation. The set of examples S consists of all 
canonical basis vectors, associated to positive class. 


11 Appendix — Experiments 

11.1 Supplementary experiments to Table 

Table is obtained under the same experimental setting as that of Table with an important 
modification in how the normalized edge is computed. More specifically, the computation of rt in 
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Domain 

m 

d 

lOOo- 

AdaBoost* 

\ 

err±(T 

ADABOOST(n)* 

\ 

RadoBoost* 

\ 

P 

p' 

Fertility 

100 

9 

- 

44.00±18.38 

Y 

57.00±17.03 

N 

53.00±14.18 

— 

0.28 

0.42 

Haberman 

306 

3 

- 

25.78±4.78 

N 

41.88±12.38 

N 

25.77±6.04 

Y 

0.98 

€ 

Transfusion 

748 

4 

- 

39.19±6.66 

Y 

36.78±5.76 

Y 

36.65±5.74 

Y 

0.04 

0.95 

Banknote 

1 372 

4 

- 

2.70±1.38 

Y 

2.70±1.38 

N 

13.93±3.68 

Y 

e 

e 

Breast wise 

699 

9 

- 

2.86±1.90 

Y 

4.43±2.07 

N 

3.58±1.69 

Y 

0.24 

0.14 

Ionosphere 

351 

33 

- 

11.92±7.03 

N 

11.37±4.94 

Y 

17.07±9.26 

N 

0.05 

0.03 

Sonar 

208 

60 

- 

25.60±11.41 

Y 

30.36±10.46 

N 

27.02±12.77 

Y 

0.51 

0.43 

Wine-red* 

1 599 

11 

1 

26.33±4.00 

N 

25.95±4.01 

Y 

27.70±3.39 

Y 

0.05 

0.03 

Abalone* 

4 177 

8 

- 

25.59±2.59 

N 

25.45±2.74 

N 

24.80±2.59 

Y 

0.18 

0.07 

Wine-white* 

4 898 

11 

1 

31.07±2.10 

N 

30.54±2.06 

N 

33.42±2.38 

N 

e 

e 

Magic* 

19 020 

10 

- 

21.18±1.16 

N 

21.23±1.34 

N 

22.90±2.19 

N 

e 

£ 

EEC 

14 980 

14 

14 

43.54±1.67 

Y 

43.06±2.35 

Y 

43.73±1.89 

Y 

0.67 

0.09 

Hardware* 

28 179 

95 

- 

3.01±0.27 

Y 

2.70±0.39 

Y 

7.35±3.31 

Y 

e 

£ 

Twitter* 

583 250 

77 

44 

6.08±0.15 

Y 

6.72±0.64 

Y 

5.71±0.64 

Y 

0.07 

£ 

SuSy 

5 000 000 

17 

- 

28.17±0.03 

N 

27.92±1.40 

N 

27.14±0.39 

Y 

e 

0.13 

Higgs 

11 000 000 

28 

- 

46.20±0.05 

N 

47.68±0.55 

N 

47.86±0.06 

— 

e 

0.34 


Table 3: Comparison of RadoBoost to AdaBoost (I2Z]) and AdaBoost trained with a random 
subset of training of the same size as S* (ADABooST(n)). The symbol indicates algorithms 
are ran with the replacement of eq. (131) for the normalized edge Conventions are the same as 
in Table The symbols Y, N, —, respectively indicate whether the new version performs better 
than (resp. worse than, similarly to) the non-modihed version. 


Step 2.2 of RadoBoost (see ([^) is completed by the following step: 

n ^ sign(rt) • max{0.1, |rt|} (131) 

The same modification is also carried out in AdaBoost ([27]) (Corollary 1). This aims to prevent 
the fact that domains with outlier feature values could trick AdaBoost in picking the wrong sign 
for at for a large number of iterations, due to values of rt with a very small magnitude (but with the 
wrong sign). Experiments display that this corrects AdaBoost’s bad results on Twitter, but on 
other domains like Fertility, Haberman, Sonar, Abalone, the change happens to give worse results 
for AdaBoost and/or ADABooST(n). RadoBoost’s results, on the other hand, tend to improve 
with sparse exceptions. 


11.2 Supplementary experiments to Section]^ — I / III 

Tables iHlilZl present results comparing AdaBoost, RadoBoost with random rados and 
RadoBoost with fixed support size rados (m*). Unless otherwise stated in Tables, the following 
experimental setup holds: 


• RadoBoost is trained with n = min{ 1000, train fold size/2} rados; 

• AdaBoost is trained using the complete training fold; 


• for each standard deviation a, we generate 10 noisy domains; each is then processed following 
10 folds stratified cross-validation. Thus, each dot on the colored curves is the average of ten 
experiments; 


RadoBoost is trained with two types of rados: random rados as in Section]^— this gives the 
grey dashed curves —, or rados with fixed support m* (noted s on the plots) as in Subsection 
5.2 — this gives the colored curves —; 
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11.3 Supplementary experiments to Section]^ — II / III 

Tables and compare RadoBoost trained with rados of fixed support and using a “prudential” 
weak learner (which picks the median feature according to |rt|), to RadoBoost trained with 
plain random rados and using the “strongest” possible weak learner which picks the best feature 
according to \rt\. 

11.4 Supplementary experiments to Section]^— III / III 

Tables [T0| and pT] compare two different rado generation mechanisms with respect to RadoBoost: 
the random generation of arbitrary rados (Section]^, and the random generation of rados with 
fixed support (Subsection |5.2[ ). In both Tables, the weak learner is always the same (contrary to 
Tablesand 1^ , i.e. the “strong” weak learner that picks the best feature according to \rt\, at each 
iteration. 
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WFi = Strong 





o (Gaussian) 


o 

;=! 


§ 



o (Gaussian) 


WFI = Median-prudential 




a (Gaussian) 


jj 

n 

o 

o 

CQ 


JJ 

CO 

o 

o 

CQ 


0) 

a 


jj 

CO 

o 

o 

CQ 


0) 

a 


Table 4: Learning from examples that have been noisified using the Gaussian mechanism ^(0, ct^I) 
(See Section 10.7), as a function of o'. In each plot, the right axis gives AdaBoost’s ( 127!) test 
error, related to the big dotted curve. All other curves are related to the left axis, ■which gives the 
difference of test errors (Aperr) between RadoBoost and AdaBoost. The grey dashed curve is 
for rados picked uniformly at random in following Section]^ The colored curves (green, red, 
blue) correspond to rados with fixed support s (= m*) such that s/m G {0.25,0.5,0.75}, generated 
with the mechanism of Section |5.2[ m refers to the size of a training fold. Range of a is not the 
same on the left and right plots. The horizontal dashed black line indicates Aperr = 0: colored 
lines below this line indicate runs of RadoBoost that are better than AdaBoost’s. 
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Table 5: Learning from examples that 
(See Section 10.7), as a function of a. 


have been noisified using the 
Conventions follow Table 


Gaussian mechanism lsl(0, ct^I) 
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Oj 

Ci 

o 

zn 


WFI = Strong 


WFi = Median-prudential 



(Rado: s / m = 0.25) 
(Rado: s / m = 0.50) ■ 
(Rado: s / m = 0.75) ■ 
Aperr (Rado: random) 


10 

(Gaussian) 



10 100 1000 
a (Gaussian) 


0) 

CD 



1 10 100 
c (Gaussian) 



10 100 1000 
o (Gaussian) 


CD 

Ci 

IS 

.T) 

C 


Aperr (Rado: s / m = 0.25) - 

Aperr (Rado: s / m = 0.50) 

^Aperr (Rado: s / m = 0.75) - 

Aperr (Rado: random) . 

perr AdaBoost ~0“ 



• (Rado: s / m = 0.25) 

• (Rado: s / m = 0.50) 

= m — n Tc\ 



40 M 
30 a 


10 100 
(Gaussian) 


1000 10000 


Table 6: Learning from examples that 
(See Section 10.7), as a function of a. 


have been noisified using the 
Conventions follow Table 


Gaussian mechanism 11(0, ct^I) 
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WFi = Strong 



WFi = Median-prudential 



a (Gaussian) 



bJD 

CD 




a 


o (Gaussian) 


a (Gaussian) 


4-1 

w 

o 

o 

CQ 


01 


4-1 

CO 

o 

o 

CQ 


0) 

a 


Table 7: Learning from examples that 
(See Section 10.7), as a function of ct. 


have been noisified using the 
Conventions follow Table 


Gaussian mechanism ^(0, ct^I) 
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Fertility 


Haberman 




Transfusion 



Banknote 



Breastwisc 


Ionosphere 


Table 8: Test error of RadoBoost trained with rados with fixed support and Median-prudential 
weak learner (Subsection |5.2[ ), minus test error of RadoBoost trained with random rados and the 
“Strong” weak learner of Section]^ (i.e. the one that picks the best feature at each iteration), as 
a function of the Gaussian mechanism’s standard deviation ct. Horizontal dashed line correspond 
to Aperr = 0. Points below this line denote better performances over the rados with fixed support 
and with the prudential weak learner, s is the support size (m relates to the size of the training 
fold), for three values, s/m = 0.25 (green), s/m = 0.5 (red) and s/m = 0.75 (blue). 
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Sonar 


Winered 




Abalone 


Wine-white 




Magic 


Eeg 


Table 9: Test error of RadoBoost trained with rados with fixed support 
weak learner, minus test error of RadoBoost trained with random rados 
learner of Section HI Conventions follow Table [H 


and Median-prudential 
and the “Strong” weak 
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Fertility 


Haberman 





Breastwisc 


Ionosphere 


Table 10: Test error of RadoBoost trained with rados with fixed support minus test error of 
RadoBoost trained with plain random rados, as a function of the Gaussian mechanism’s standard 
deviation ct. Points below the Aperr = 0 line indicate smaller errors for the training with rados 
of fixed support, s is the support size {m relates to the size of the training fold), for three values, 
sjm = 0.25 (green), sIm = 0.5 (red) and sIm = 0.75 (blue). 
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Sonar 


Winered 




Abalone 



Wine-white 



Magic 


Eeg 


Table 11: Test error 
RadoBoost trained 


of RadoBoost trained with rados with fixed support minus 
with plain random rados (continued). Conventions follow Table 


test 

To 


error 
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